Linear regression is the statistical method for fitting a line to data where the relationship between two variables, x and y, can be modeled by a straight line with some error
\(\beta_0\) = y-intercept
\(\beta_1\) = slope
\(\epsilon\) = error term (a.k.a residual)
Introduction to linear regression
When we use \(x\) to predict \(y\), we usually call:
\(x\) the explanatory, predictor or independent variable,
\(y\) the response or dependent variable
As we move forward in this chapter, we will learn about:
criteria for line-fitting
the uncertainty associated with estimates of model parameters
Introduction to linear regression
Introduction to linear regression
World’s most useless regression
Possums…?
x <-read.csv("possum.csv")head(x)
site pop sex age head_l skull_w total_l tail_l
1 1 Vic m 8 94.1 60.4 89.0 36.0
2 1 Vic f 6 92.5 57.6 91.5 36.5
3 1 Vic f 6 94.0 60.0 95.5 39.0
4 1 Vic f 6 93.2 57.1 92.0 38.0
5 1 Vic f 2 91.5 56.3 85.5 36.0
6 1 Vic f 1 93.1 54.8 90.5 35.5
World’s most useless regression
lm1 <-lm(head_l ~ total_l, x)coefficients(lm1)
(Intercept) total_l
42.7097931 0.5729013
World’s most useless regression
summary(x$total_l)
Min. 1st Qu. Median Mean 3rd Qu. Max.
75.00 84.00 88.00 87.09 90.00 96.50
Residuals are the leftover variation after accounting fo the model fit
The residual for the \(i^{th}\) observation (\(x_i, y_i\)) is the difference of the observed response (\(y_i\)) and the response that we would predict based on the model fit (\(\hat{y_i}\))
Mathematically, we want a line that minimizes the magnitude of residuals
Most commonly, this is done by minimizing the sum of the squared residuals
\[min arg? = e_1^2 + e_2^2 + ... + e_n^2\]
Least squares regression
Conditions for the least squares line
Linearity: The data should show a linear trend
Normal residuals: Generally, the residuals must be nearly normal. When this condition is found to be unreasonable, it is usually because of outliers or concerns about influential points
Constant variability: The variability of points around the least squares line remains roughly constant
Independent observations: Be cautious about applying regression to time series data, which are sequential observations in time such as a stock price each day
You might recall the point-slope form of a line from math class, which we can use to find the model fit, including the estimate of \(\beta_0\)
\[y - \bar{y} = beta_1 \times (x - \bar{x})\]
To find the y-intercept, set \(x = 0\)
\[b_0 = \bar{y} - beta_1 \bar{x}\]
Least squares regression
Intercept
(mean_y - b1*mean_x)/1000
family_income
24.31979
coefficients(lm2)[1]; b0
(Intercept)
24.31933
(Intercept)
24.31933
Least squares regression
Interpretation: Slope
What do these regression coefficients mean?
The slope describes the estimated difference in the \(y\) variable if the explanatory variable \(x\) for a case happened to be one unit larger.
For each additional $1,000 of family income, we would expect a student to receive a net difference of $1,000×(−0.0431) = −$43.10 in aid on average, i.e. $43.10 less.
Least squares regression
Interpretation: Slope
What do these regression coefficients mean?
The intercept describes the average outcome of \(y\) if \(x = 0\) and the linear model is valid all the way to \(x = 0\), which in many applications is not the case.
The estimated intercept \(\beta_0 = 24,319\) describes the average aid if a student’s family had no income.
We must be cautious in this interpretation: while there is a real association, we cannot interpret a causal connection between the variables.
Least squares regression
Extrapolation
Applying a model estimate to values outside of the realm of the original data is called extrapolation.
If we extrapolate, we are making an unreliable bet that the approximate linear relationship will be valid in places where it has not been analyzed.
as.numeric(b0 + b1*1000)*1000
[1] -18752.32
Least squares regression
Extrapolation
Least squares regression
Strength of a fit: R-squared
We evaluated the strength of the linear relationship between two variables earlier using the correlation, \(R\).
However, it is more common to explain the strength of a linear fit using \(R^2\)
If provided with a linear model, we might like to describe how closely the data cluster around the linear fit.
About 25% of the variation in gift aid can be explained by differences in family income.
Least squares regression
Example 3: Categorical Predictors
x <-read.csv("mariokart.csv")y <-data.frame(new =ifelse(x$cond =="new", 1, 0),price = x$total_pr)summary(y)
new price
Min. :0.0000 Min. : 28.98
1st Qu.:0.0000 1st Qu.: 41.17
Median :0.0000 Median : 46.50
Mean :0.4126 Mean : 49.88
3rd Qu.:1.0000 3rd Qu.: 53.99
Max. :1.0000 Max. :326.51
We might wonder, is this convincing evidence that the “true” linear model has a negative slope?
That is, do the data provide strong evidence that the political theory is accurate, where the unemployment rate is a useful predictor of the midterm election?
\(H_0: \beta_1 = 0\)
\(H_0: \beta_1 \neq 0\)
Inference for linear regression
Example: Midterm elections & unemployment
sd_x <-sd(x$unemp)sd_y <-sd(x$house_change)n <-nrow(x)z <-list()for(i in1:10000){set.seed(i) y <-data.frame(x =rnorm(n, mean =0, sd = sd_x),y =rnorm(n, mean =0, sd = sd_y) ) z[[length(z)+1]] <-coefficients(lm(y ~ x, y))[2]}z <-unlist(z)dz <-density(z)